{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluation Guide for TAP-B\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "We provide three simple scripts to evaluate **TAP** on *Instance Segmentation*, *Instance Classification* and *Region Caption*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Necessary datasets, imports and models for evaluation.\n",
    "\n",
    "```\n",
    "datasets\n",
    "|_ coco\n",
    "|  |_ train2017\n",
    "|  |  |_ 000000000009.jpg\n",
    "|  |  |_ ...\n",
    "|  |_ val2017\n",
    "|  |  |_ 000000000139.jpg\n",
    "|  |  |_ ...\n",
    "|  |_ annotations\n",
    "|  |  |_ coco_instances_val2017.json\n",
    "|  |_ results\n",
    "|  |  |_ coco_instances_val2017_vitdet_h_cascade.json  # Run detectron2 to generate this file.\n",
    "|- lvis\n",
    "|  |_ annotations\n",
    "|  |  |_ lvis_val_v1.json\n",
    "|  |_ results\n",
    "|  |  |_ lvis_val_v1_vitdet_h_cascade.json  # Run detectron2 to generate this file.\n",
    "|_ vg\n",
    "|  |_ images\n",
    "|  |  |_ 1.jpg\n",
    "|  |  |_ ...\n",
    "|  |_ annotations\n",
    "|  |  |_ test.json  # https://datarelease.blob.core.windows.net/grit/VG_preprocessed_annotations/test.json\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append(\"..\")\n",
    "\n",
    "class AttrDict(dict):\n",
    "    def __init__(self, *args, **kwargs):\n",
    "        super(AttrDict, self).__init__(*args, **kwargs)\n",
    "        self.__dict__ = self\n",
    "\n",
    "# global arguments.\n",
    "args = AttrDict(read_every=100, prompt_size=256)\n",
    "args.model_type = \"tap_vit_b\"\n",
    "args.checkpoint = \"../models/tap_vit_b_b45cbf.pkl\"\n",
    "args.device = [0, 1, 2, 3, 4, 5, 6, 7]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation: Instance Segmentation on COCO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "92857 instances in 5000 images.\n",
      "im_process: 5000/5000 [0.064s + 0.013s] (eta: 0:00:00)\n",
      "Writing segmentations to /data/workspace/models/tokenize-anything/scripts/../outputs/coco_segmentations.json\n",
      "\n",
      "Evaluating COCO segmentations...\n",
      "loading annotations into memory...\n",
      "Done (t=0.46s)\n",
      "creating index...\n",
      "index created!\n",
      "Loading and preparing results...\n",
      "DONE (t=1.58s)\n",
      "creating index...\n",
      "index created!\n",
      "Running per image evaluation...\n",
      "Evaluate annotation type *segm*\n",
      "DONE (t=25.70s).\n",
      "Accumulating evaluation results...\n",
      "DONE (t=3.65s).\n",
      "Summary:\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.449\n",
      " Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.711\n",
      " Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.479\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.280\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.499\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.609\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.339\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.537\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.559\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.416\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.616\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.689\n"
     ]
    }
   ],
   "source": [
    "from scripts.eval_seg import main\n",
    "\n",
    "args.images_dir = \"../datasets/coco/val2017\"\n",
    "args.gt_json_file = \"../datasets/coco/annotations/coco_instances_val2017.json\"\n",
    "args.det_json_file = \"../datasets/coco/results/coco_instances_val2017_vitdet_h_cascade.json\"\n",
    "main(args)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation: Instance Segmentation on LVIS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3293288 instances in 19809 images.\n",
      "im_process: 19809/19809 [0.111s + 0.125s] (eta: 0:00:00)\n",
      "Writing segmentations to /data/workspace/models/tokenize-anything/scripts/../outputs/lvis_segmentations.json\n",
      "\n",
      "Evaluating LVIS segmentations...\n",
      "Summary:\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=all] = 0.417\n",
      " Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=300 catIds=all] = 0.607\n",
      " Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=300 catIds=all] = 0.442\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=     s | maxDets=300 catIds=all] = 0.289\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=     m | maxDets=300 catIds=all] = 0.545\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=     l | maxDets=300 catIds=all] = 0.637\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=  r] = 0.331\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=  c] = 0.427\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=  f] = 0.442\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=all] = 0.504\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=     s | maxDets=300 catIds=all] = 0.361\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=     m | maxDets=300 catIds=all] = 0.631\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=     l | maxDets=300 catIds=all] = 0.698\n"
     ]
    }
   ],
   "source": [
    "from scripts.eval_seg import main\n",
    "\n",
    "args.images_dir = \"../datasets/coco/val2017\"\n",
    "args.gt_json_file = \"../datasets/lvis/annotations/lvis_val_v1.json\"\n",
    "args.det_json_file = \"../datasets/lvis/results/lvis_val_v1_vitdet_h_cascade.json\"\n",
    "main(args)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation: Instance Classification on LVIS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "244707 instances in 19809 images.\n",
      "im_process: 19809/19809 [0.050s + 0.021s] (eta: 0:00:00)\n",
      "Writing detections to /data/workspace/models/tokenize-anything/scripts/../outputs/lvis_detections.json\n",
      "\n",
      "Evaluating LVIS detections...\n",
      "Summary:\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=all] = 0.564\n",
      " Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=300 catIds=all] = 0.568\n",
      " Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=300 catIds=all] = 0.565\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=     s | maxDets=300 catIds=all] = 0.422\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=     m | maxDets=300 catIds=all] = 0.695\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=     l | maxDets=300 catIds=all] = 0.806\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=  r] = 0.556\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=  c] = 0.555\n",
      " Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=  f] = 0.577\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=all] = 0.646\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=     s | maxDets=300 catIds=all] = 0.478\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=     m | maxDets=300 catIds=all] = 0.764\n",
      " Average Recall     (AR) @[ IoU=0.50:0.95 | area=     l | maxDets=300 catIds=all] = 0.871\n"
     ]
    }
   ],
   "source": [
    "from scripts.eval_cls import main\n",
    "\n",
    "args.images_dir = \"../datasets/coco/val2017\"\n",
    "args.gt_json_file = \"../datasets/lvis/annotations/lvis_val_v1.json\"\n",
    "args.concept = \"../concepts/lvis_1203.pkl\"\n",
    "args.max_dets = 300\n",
    "main(args)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation: Region Caption on Visual Genome"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "232935 instances in 5000 images.\n",
      "im_process: 5000/5000 [0.157s] (eta: 0:00:00)\n",
      "Evaluating captions...\n",
      "Bleu [0.36803494977180123, 0.23047192753065446, 0.16138780380493958, 0.11938476662571457]\n",
      "METEOR 0.17416393195210034\n",
      "Rouge 0.354880694034182\n",
      "CIDEr 1.4921442501120745\n"
     ]
    }
   ],
   "source": [
    "from scripts.eval_cap import main\n",
    "\n",
    "args.images_dir = \"../datasets/vg/images\"\n",
    "args.gt_json_file = \"../datasets/vg/annotations/test.json\"\n",
    "main(args)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
